NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

iSeqSearch: incremental protein search for iBlast/iMMSeqs2/iDiamond

https://doi.org/10.7717/peerj.19171

Yoo, Hyunwoo; Refahi, Mohammadsaleh; Polikar, Robi; Sokhansanj, Bahrad A; Brown, James R; Rosen, Gail L (April 2025, PeerJ)

BackgroundThe advancement of sequencing technology has led to a rapid increase in the amount of DNA and protein sequence data; consequently, the size of genomic and proteomic databases is constantly growing. As a result, database searches need to be continually updated to account for the new data being added. However, continually re-searching the entire existing dataset wastes resources. Incremental database search can address this problem. MethodsOne recently introduced incremental search method is iBlast, which wraps the BLAST sequence search method with an algorithm to reuse previously processed data and thereby increase search efficiency. The iBlast wrapper, however, must be generalized to support better performing DNA/protein sequence search methods that have been developed, namely MMseqs2 and Diamond. To address this need, we propose iSeqsSearch, which extends iBlast by incorporating support for MMseqs2 (iMMseqs2) and Diamond (iDiamond), thereby providing a more generalized and broadly effective incremental search framework. Moreover, the previously published iBlast wrapper has to be revised to be more robust and usable by the general community. ResultsiMMseqs2 and iDiamond, which apply the incremental approach, perform nearly identical to MMseqs2 and Diamond. Notably, when comparing ranking comparison methods such as the Pearson correlation, we observe a high concordance of over 0.9, indicating similar results. Moreover, in some cases, our incremental approach, iSeqsSearch, which extends the iBlast merge function to iMMseqs2 and iDiamond, provides more hits compared to the conventional MMseqs2 and Diamond methods. ConclusionThe incremental approach using iMMseqs2 and iDiamond demonstrates efficiency in terms of reusing previously processed data while maintaining high accuracy and concordance in search results. This method can reduce resource waste in continually growing genomic and proteomic database searches. The sample codes and data are available at GitHub and Zenodo (https://github.com/EESI/Incremental-Protein-Search; DOI:10.5281/zenodo.14675319).
more » « less
Free, publicly-accessible full text available April 28, 2026
The Naïve Bayes classifier++ for metagenomic taxonomic classification—query evaluation

https://doi.org/10.1093/bioinformatics/btae743

Duan, Haozhe_Neil; Hearne, Gavin; Polikar, Robi; Rosen, Gail_L; Kendziorski, ed., Christina (December 2024, Bioinformatics)

Abstract MotivationThis study examines the query performance of the NBC++ (Incremental Naive Bayes Classifier) program for variations in canonicality, k-mer size, databases, and input sample data size. We demonstrate that both NBC++ and Kraken2 are influenced by database depth, with macro measures improving as depth increases. However, fully capturing the diversity of life, especially viruses, remains a challenge. ResultsNBC++ can competitively profile the superkingdom content of metagenomic samples using a small training database. NBC++ spends less time training and can use a fraction of the memory than Kraken2 but at the cost of long querying time. Major NBC++ enhancements include accommodating canonical k-mer storage (leading to significant storage savings) and adaptable and optimized memory allocation that accelerates query analysis and enables the software to be run on nearly any system. Additionally, the output now includes log-likelihood values for each training genome, providing users with valuable confidence information. Availability and implementationSource code and Dockerfile are available at http://github.com/EESI/Naive_Bayes.
more » « less
Semi-Supervised and Incremental Sequence Analysis for Taxonomic Classification

https://doi.org/10.1109/SSCI52147.2023.10371886

Fasino, Adriana; Ozdogan, Emrecan; Sokhansanj, Bahrad A; Rosen, Gail; Polikar, Robi (December 2023, IEEE)

Full Text Available
Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering

https://doi.org/10.7717/peerj.14779

Nguyen, Rachel; Sokhansanj, Bahrad A.; Polikar, Robi; Rosen, Gail L. (January 2023, PeerJ)

A major challenge for clustering algorithms is to balance the trade-off between homogeneity, i.e. , the degree to which an individual cluster includes only related sequences, and completeness, the degree to which related sequences are broken up into multiple clusters. Most algorithms are conservative in grouping sequences with other sequences. Remote homologs may fail to be clustered together and instead form unnecessarily distinct clusters. The resulting clusters have high homogeneity but completeness that is too low. We propose Complet+, a computationally scalable post-processing method to increase the completeness of clusters without an undue cost in homogeneity. Complet+ proves to effectively merge closely-related clusters of protein that have verified structural relationships in the SCOPe classification scheme, improving the completeness of clustering results at little cost to homogeneity. Applying Complet+ to clusters obtained using MMseqs2’s clusterupdate achieves an increased V-measure of 0.09 and 0.05 at the SCOPe superfamily and family levels, respectively. Complet+ also creates more biologically representative clusters, as shown by a substantial increase in Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) metrics when comparing predicted clusters to biological classifications. Complet+ similarly improves clustering metrics when applied to other methods, such as CD-HIT and linclust. Finally, we show that Complet+ runtime scales linearly with respect to the number of clusters being post-processed on a COG dataset of over 3 million sequences. Code and supplementary information is available on Github: https://github.com/EESI/Complet-Plus .
more » « less
Full Text Available
Semi-supervised and Incremental VSEARCH for Metagenomic Classification

https://doi.org/10.1109/SSCI51031.2022.10022184

Ozdogan, Emrecan; Fasino, Adriana; Nguyen, Rachel; Sokhansanj, Bahrad; Rosen, Gail; Polikar, Robi (December 2022, 2022 IEEE Symposium Series on Computational Intelligence (SSCI))

DNA Sequencing of microbial communities from en-vironmental samples generates large volumes of data, which can be analyzed using various bioinformatics pipelines. Unsupervised clustering algorithms are usually an early and critical step in an analysis pipeline, since much of such data are unlabeled, unstructured, or novel. However, curated reference databases that provide taxonomic label information are also increasing and growing, which can help in the classification of sequences, and not just clustering. In this contribution, we report on our progress in developing a semi-supervised approach for genomic clustering algorithms, such as U/VSEARCH. The primary contribution of this approach is the ability to recognize previously seen or unseen novel sequences using an incremental approach: for sequences whose examples were previously seen by the algorithm, the algorithm can predict a correct label. For previously unseen novel sequences, the algorithm assigns a temporary label and then updates that label with a permanent one if/when such a label is established in a future reference database. The incremental learning aspect of the proposed approach provides the additional benefit and capability to process the data continuously as new datasets become available. This functionality is notable as most sequence data processing platforms are static in nature, designed to run on a single batch of data, whose only other remedy to process additional data is to combine the new and old data and rerun the entire analysis. We report our promising preliminary results on an extended 16S rRNA database.
more » « less
Full Text Available
Incremental and Semi-Supervised Learning of 16S-rRNA Genes For Taxonomic Classification

https://doi.org/10.1109/ssci50451.2021.9660093

Ozdogan, Emrecan; Sabin, Norman C.; Gracie, Thomas; Portley, Steven; Halac, Mali; Coard, Thomas; Trimble, William; Sokhansanj, Bahrad; Rosen, Gail; Polikar, Robi (January 2022, 2021 IEEE Symposium Series on Computational Intelligence (SSCI))

Genome sequencing generates large volumes of data and hence requires increasingly higher computational resources. The growing data problem is even more acute in metagenomics applications, where data from an environmental sample include many organisms instead of just one for the common single organism sequencing. Traditional taxonomic classification and clustering approaches and platforms - while designed to be computationally efficient - are not capable of incrementally updating a previously trained system when new data arrive, which then requires complete re-training with the augmented (old plus new) data. Such complete retraining is inefficient and leads to poor utilization of computational resources. An ability to update a classification system with only new data offers a much lower run-time as new data are presented, and does not require the approach to be re-trained on the entire previous dataset. In this paper, we propose Incremental VSEARCH (I-VSEARCH) and its semi-supervised version for taxonomic classification, as well as a threshold independent VSEARCH (TI-VSEARCH) as wrappers around VSEARCH, a well-established (unsupervised) clustering algorithm for metagenomics. We show - on a 16S rRNA gene dataset - that I-VSEARCH, running incrementally only on the new batches of data that become available over time, does not lose any accuracy over VSEARCH that runs on the full data, while providing attractive computational benefits.
more » « less
Full Text Available
Incremental & Semi-Supervised Learning for Functional Analysis of Protein Sequences

https://doi.org/10.1109/SSCI50451.2021.9659958

Halac, Mali; Sokhansanj, Bahrad; Trimble, William L.; Coard, Thomas; Sabin, Norman C.; Ozdogan, Emrecan; Polikar, Robi; Rosen, Gail L. (December 2021, Incremental & Semi-Supervised Learning for Functional Analysis of Protein Sequences)

Full Text Available

Search for: All records